Fix mlnet performance benchmark timeouts by pre-provisioning the SSWE model into the Helix payload#5250
Open
LoopedBard3 wants to merge 3 commits into
Open
Conversation
The mlnet performance benchmarks (StochasticDualCoordinateAscentClassifierBench.TrainSentiment) apply a pretrained SSWE word embedding that ML.NET downloads (~70 MB) from aka.ms/mlnet-resources at benchmark runtime. That download stalls on the Helix machines, hanging the entire mlnet work item until it times out and is killed, discarding all mlnet results so every mlnet benchmark appears to fail. Download the model on the build agent (reliable connectivity) into the correlation payload and point MICROSOFTML_RESOURCE_PATH at it via the Helix pre-commands, removing the runtime network dependency. Best-effort and strictly gated on run_kind == mlnet, so non-mlnet runs are unaffected and a download failure falls back to prior behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses persistent ML.NET benchmark work item timeouts in the performance-ci Helix pipeline by removing a runtime network dependency (ML.NET’s SSWE embedding download) and instead pre-staging the model inside the Helix correlation payload.
Changes:
- Added a best-effort pre-provisioning step that downloads
sentiment.emd(SSWE embedding) into the correlation payload forrun_kind == "mlnet". - Wired Helix pre-commands to set
MICROSOFTML_RESOURCE_PATHto the staged payload directory when provisioning succeeds. - Added retry + fallback URL logic for the model download.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Download to a temp file, validate the size against Content-Length (when present) and a minimum-size floor, then atomically replace the destination so a truncated or early-closed response can't leave a corrupt sentiment.emd in the payload. Also make the function docstring more concise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the server sends a Content-Length, an exact match fully validates the download, so trust it regardless of size (the asset could legitimately shrink without becoming invalid). Only fall back to the minimum-size floor when no Content-Length is available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Every ML.NET performance benchmark in the
performance-cipipeline (public definitionId=38) is timing out. The wholemlnetwork item hangs and is eventually killed at the work item timeout, discarding all ML.NET results, so every mlnet test shows as failed.Example failing run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1478345
Root cause
The culprit is
StochasticDualCoordinateAscentClassifierBench.TrainSentiment, which applies a pretrained SSWE word embedding. ML.NET downloads that model (sentiment.emd, ~70 MB) fromaka.ms/mlnet-resourcesat benchmark runtime if it isn't already on disk. On the Helix machines that download now stalls (a hung connection, not a fast failure), so the benchmark hangs at// BeforeActualRunand the work item runs until it times out.I reproduced this locally: with the model download blocked (stalling proxy) and the cache cleared,
TrainSentimenthangs at// BeforeActualRunexactly like the Helix log; pre-provisioning the model + settingMICROSOFTML_RESOURCE_PATHruns it to completion with zero network. UpdatingMicrosoft.ML(tested 5.0.0) does not help — the runtime download persists in all versions.Fix
In
scripts/run_performance_job.py, forrun_kind == "mlnet"only:mlnet-resources/Text/Sswe/sentiment.emd, with retries and a blob→aka.ms fallback.MICROSOFTML_RESOURCE_PATHto that payload dir via the Helix pre-commands, so ML.NET loads the embedding from disk and never makes the network call.This is best-effort and strictly gated on
run_kind == "mlnet": non-mlnet runs are unaffected, and if the agent download fails it logs a warning and falls back to today's behavior.Why now?
The benchmark hasn't changed since 2019 and this isn't caused by any PR — the trigger is environmental on the download path (Helix egress/proxy/TLS tightening, blob throttling, and/or a .NET 9+ HttpClient behavior change against this endpoint — .NET 8 sometimes completed while 9.0/main/ubuntu hung). It manifests as a multi-hour timeout because it's a stalled read rather than a fast failure. Pre-staging the asset removes the dependency regardless of which is the actual culprit.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com